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Abstract. We introduce a new nonlinear model for classification, in which we model the joint 
distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process 
mixtures. We keep the relationship between y and x linear within each component of the mixture. 
The overall relationship becomes nonlinear if the mixture contains more than one component. We 
use simulated data to compare the performance of this new approach to a simple multinomial logit 
(MNL) model, an MNL model with quadratic terms, and a decision tree model. We also evaluate 
our approach on a protein fold classification problem, and find that our model provides substantial 
improvement over previous methods, which were based on Neural Networks (NN) and Support 
Vector Machines (SVM). Folding classes of protein have a hierarchical structure. We extend our 
method to classification problems where a class hierarchy is available. We find that using the prior 
information regarding the hierarchical structure of protein folds can result in higher predictive 
accuracy. 



1 Introduction 



In regression and classification models, estimation of parameters and interpretation of results are 
easier if we assume a simple distributional form (e.g., normality) and regard the relationship between 
response variable and covariates as linear. However, the performance of the model obtained depends 
on the appropriateness of these assumptions. Poor performance may result from assuming wrong 
distributions, or regarding relationships as linear when they are not. In this paper, we introduce a 
new model based on a Dirichlet process mixture of simple distributions, which is more flexible to 
capture nonlinear relationships. 

A Dirichlet process, T>(Gq : 7), with baseli ne distribution Go and scale parameter 7, is a dis- 
tribution over distributions. iFerguson (jl973h introduced the Dirichlet process as a class of prior 
distributions for which the suppor t is large, and the posterio r distr ibution is manageable analyti- 
cally. Using the Polya urn scheme, iBlackwell and MacQueenl (|1973l ) showed that the distributions 



1 



sampled from a Dirichlet process are discrete almost surely. The idea of using a Dirichlet process 
as the prior for the mixing proportions of a simple distribution (e.g., Gaussian) was first introduced 



bv lAntoniakl (|l974 ). 



We will describe the Dirichlet process mixture model as a limit of finite mixture model (see iNeal 



(|2OO0h for further description). Suppose yi,...,y n are drawn independently from some unknown 



distribution. We can model the distribution of y as a mixture of simple distributions such that: 

c 



P{y) = 2> c /(i/|& 



c=l 

Here, p c are the mixing proportions, and / is a simple class of distributions, such as normal with 
4> = (n,cr). We first assume that the number of mixing components, C, is finite. In this case, a 
common prior for p c is a symmetric Dirichlet distribution: 

r(7) ' ' 



p(pu-,pc) = - Up c 



(7/C)-l 



where p c > and ^2p c = 1- Parameters <p c are assumed to be independent under the prior with 
distribu tion Go. W e can use mixture identifiers, Cj, and represent the above mixture model as 



follows (jNeail2OO0h : 



Vi\ci,(f> ~ F((f> c .) 
Ci\pi,.-,pc ~ Discrete(pi,...,p c ) 
p 1 ,...,p c ~ Dirichlet^ /C, ....,7/C) 

By integrating over the Dirichlet prior, we can eliminate mixing proportions, p c , and obtain the 
following conditional distribution for q: 

P(ci = cci,...,Cj_i) = - — — (2) 

1 — 1 + 7 

Here, nj c represents the number of data points previously (i.e., before the i t/l ) assigned to component 
c. The probability of assigning each component to the first data point is 1/C. As we proceed, this 
probability becomes higher for components with larger numbers of samples (i.e., larger nj c ). 

When C goes to infinity, the conditional probabilities ([2]) reach the following limits: 

P[(H = C\Cl, ...,Cj_l) - 



P(ci / CjVj < i\ci,...,Ci-i) 



i — 1 + 7 

7 (3) 



i — 1 + 7 

As a result, the conditional probability for 0j, where 0j = Ci , becomes 



Ew+Hr- Go (4) 

1 + 7 « — 1 + 7 



7j Pi, ~ , 

z — 1 + 7 ? — 1 + 7 
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where 6(9) is a point mass distribution at 6. This is equivalent to the conditional probabilities 
implied by the Dirichlet process mixture model, which has the following form: 



Vi\6i 
6i\G 
G 



m) 

G 



(5) 



That is, the limit of the finite mixture model (pQ) is equivalent to the Dirichlet process mixture 
model (0) as the number of components goes to infinity. G is the distribution over 0's, and has a 
Dirichlet process prior, T>. The parameters of the Dirichlet process prior are Go, a distribution from 
which 0's are sampled, and 7, a positive scale parameter that controls the number of components in 
the mixture, such that a larger 7 results in a larger number of components. Phrased this way, each 
data point, i, has its own parameters, 6i, drawn from a distribution that is drawn from a Dirichlet 
process prior. But since distributions drawn from a Dirichlet process are discrete (almost surely), 
the &i for different data points may be the same. 
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Bush and MacEachernl ()1996l ). Escobar and Westl (|199a ). iMacEachern and Mullerl (|1998), and 



(|2000l ) have used this method for density estimation. iMiiller et all (|1996l ) used Dirichlet 
process mixtures for curve fitting. They model the joint distribution of data pairs (xi,yi) as a 
Dirichlet process mixture of multivariate normals. The conditional distribution, P(y\x), and the 
expected value, E(y\x), are estimated based on this distribution for a grid of x's (with interpolation) 



to ob tain a nonparametric curve. The application of this approach (as presented bv lMiiller et al. 



1996) is restricted to continuous variables. Moreover, this model is feasible only for problems with 
a small number of covariates, p. For data with moderate to large dimensionality, estimation of 
the joint distribution is very difficult both statistically and computationally. This is mostly due to 
the difficulties that arise when simulating from the posterior distribution of large full covariance 
matrices. In this approach, if a mixture model has C components, the set of full covariance matrices 
have Cp(p+l)/2 parameters. For large p, the computational burden of estimating these parameters 
might be overwhelming. Estimating full covariance matrices can also cause statistical difficulties 
since we need to assure that covariance matrices are positive semidefinite. Conj ugate priors based 
the in verse Wishart distribution satisfy this requirement, but they lack flexibility ([Daniels and Kass . 
19991 ) . Flat priors may not be suitable either, si nce they can lead to impro per posterior distributions, 
and they can be unintentionally informative (j Daniels and Kassl . Il999l ). A common approach to 
address these issu es is to use decomposition methods in specifying p riors for full covariance matrices 
(see for example, iDaniels and Kassl . [1999J; ICai and Dunsonl . 120061 ) . Although this approach has 
demonstrated some computational advantages over direct esti mation of full covariance matrices, it 
is not yet feasible for high-dimensional variables. For example. ICai and Dunsonl (|2006l ) recommend 
their approach only for problems with less than 20 covariates. 

We introduce a new nonlinear Bayesian model, which also non-parametrically estimates the joint 
distribution of the response variable, y, and covariates, x, using Dirichlet process mixtures. Within 
each component, we assume the covariates are independent, and model the dep endence between y 
and x using a linear model. Therefore, unlike the method of Miiller et al. ( 19961 ). our approach can 
be used for modeling data with a large number of covariates, since the covariance matrix for one 
mixture component is highly restricted. Moreover, this method can be used for categorical as well 
as continuous response variables by using a generalized linear model instead of the linear model of 
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each component. 

Our focus in this paper is on classification models with a multi-category response. We also 
show how our method can be extended to classification problems where classes have a hierarchical 
structure, and to problems with multiple sources of information. The next section describes our 
methodology. In Section 3, we illustrate our approach and evaluate its performance based on 
simulated data. In Section 4, we present the results of applying our model to an actual classification 
problem, which attempts to identify the folding class of a protein sequence based on the composition 
of its amino acids. Folding classes of protein have a hierarchical structure. In Section 5, we extend 
our approach to classification problems of this sort where a class hierarchy is available, and evaluate 
the performance of this new model on the protein fold recognition dataset. Section 6 shows how 
this approach can be used for multiple sources of information. Finally, Section 7 is devoted to 
discussion, future directions and limitations of the proposed method. 



2 Methodology 

Consider a classification problem with continuous covariates, x = (x\, x p ), and a categorical 
response variable, y, with J classes. To model the relationship between y and x, we model the 
joint distribution of y and x non-parametrically using Dirichlet process mixtures. Within each 
component of the mixture, the relationship between y and x is assumed to be linear. The overall 
relationship becomes nonlinear if the mixture contains more than one component. This way, while 
we relax the assumption of linearity, the flexibility of the relationship is controlled. Our model has 
the following form: 

yi,xn,...,Xip\ei ~ F(6i) 
Oi\G ~ G 

G ~ Z>(G ,7) 

where i = l,...,n indexes the observations, and I = l,...,p indexes the covariates. In our model, 
9 = (fj,, a, a, (3), and the component distributions, F(6), are defined based on P(y, x) = P(x)P{y\x) 
as follows: 

x a ~ N(m,af) 

m ., m exp(ay +x i p j ) 
P(Vi= 3\Xi,a,(3) = 



Ej'=i exp(a 3 v +x i (3 j ,) 

Here, the parameters fi = (/xi, fi p ) and a = (a±,...,a p ) are the means and standard deviations 
of covariates in each component. The component index, c, is omitted for simplicity. Within a 
component, a = (a±, ...,aj), and (3 = (/3i, (3j) are the parameters of the multinomial logit 
(MNL) model, and J is the number of classes. The entire set of regression coefficients, /3, can 
be presented as a p x J matrix. This representation is redundant, since one of the (3^8 (where 
j = 1, J) can be set to zero without changing the set of relationships expressible with the model, 
but removing this redundancy would make it difficult to specify a prior that treats all classes 
symmetrically. In this parameterization, what matters is the difference between the parameters of 
different classes. 
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Although the covariates in each component are assumed to be independent with normal priors, 
this independence of covariates exists only locally (within a component). Their global (over all 
components) dependency is modeled by assigning data to different components (i.e., clustering). 
The relationship between y and x within a component is captured using an MNL model. Therefore, 
the relationship is linear locally, but nonlinear globally. 

We could assume that y and x are independent within components, and capture the dependence 
between the response and the covariates by clustering too. However, this may lead to poor per- 
formance (e.g., when predicting the response for new observations) if the dependence of y on x is 
difficult to capture using clustering alone. Alternatively, we could also assume that the covariates 
are dependent within a component. For co ntinuous response variables, this becomes equivalent to 



the model proposed by iMiiller et al\ ()1996l ). However, as we discussed above, this approach may 
be practically infeasible for problems with a moderate to large number of covariates. We believe 
that our method is an appropriate compromise between these two alternatives. 

We define Go as follows: 

Hl\HQ,(TQ ~ N(fMo,Go) 

\og{af)\M a ,V a ~ N(M a ,V 2 ) 
aj\r ~ N(0,t 2 ) 

The parameters of Go may in turn depend on higher level hyperparameters. For example, we can 
regard the variances of coefficients as hyperparameters with the following priors: 

log(r 2 )|M T ,F r ~ N(M T ,V?) 
\og{v 2 )\M u ,V p ~ N(M V ,V*) 

We use MCMC algorithms for posterior sampling. Samples simulated from the posterior distri- 
bution are used to estimate posterior predictive probabilities. We predict the response values for 
new cases based on these probabilities. For a new case with covariates x' , the posterior predictive 
probability of response variable, y', is estimated as follows: 

Pyy =j\x) = — — 

where 

s 

P(y' = j,x') = -J2P(y' = j,x'\Go,o {s) ) 

b s=l 



s , 

s=l 



Here, S is the number of post-convergence samples from MCMC, and represents the set of 
parameters obtained at iteration s. 



Neall (|2000l ) presented several possible algorithms for sampling from the posterior distribution 



of Dirichlet process mixtures. In this research, we use Gibbs sampling with auxiliary parameters 
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Figure 1: An illustration of our model for a binary (black and white) classification problem with 
two covariates. Here, the mixture has two components, which are shown with circles and squares. 
In each component, an MNL model separates the two classes into "black" or "white" with a linear 
decision boundary. 



(Neal 's algorithm 8). This approach is similar to the algorithm proposed bv lMacEachern and Miiller 
(jl998l ). with a difference that the auxiliary parameters exist only temporarily. To improve the 
MCMC sampling, after each update using auxiliary variables, we update the component pa- 
rameters usin g the ir corre sponding data points. For a complete description of this method, see 
the paper bv iNeall tod ). All our models are coded in MATLAB and are available online at 



http : //www.utstat .utoronto . ca/~babak 



In Figure HJ we show a state from an MCMC simulation for our model in which there are 
two covariates and the response variable is binary. In this iteration, our model has identified 
two components (circles and squares). Within a component, two classes (stars and crosses) are 
separated using an MNL model. Note, the decision boundaries shown are component specific. The 
overall decision boundary, which is a smooth function, is not shown in this figure. In our approach, 
division of the data into components and fitting of MNL models are performed simultaneously. 



3 Results for synthetic data 

In this section, we illustrate our approach, henceforth called dpMNL, using synthetic data. We 
compare our model to a simple MNL model, an MNL model with quadratic terms (i.e., xix^, where 
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I = 1, ...,p and = referred to as qMNL, and a decision tree model (jBreiman et all Il993l ) 

that uses 10-fold cross-validation for pruning. For the simple MNL model, we use both Bayesian 
and maximum likelihood estimation. The models are compared with respect to their accuracy rate 
and the F\ measure. Accuracy rate is defined as the percentage of the times the correct class is 
predicted. F± is a common measurement in machine learning and is defined as: 



2Ai 



1 J 
Fx = -Y- 

where Aj is the number of cases which are correctly assigned to class j, Bj is the number cases 
incorrectly assigned to class j, and Cj is the number of cases which belong to the class j but are 
assigned to other classes. 

We do two tests. In the first test, we generate data according to the dpMNL model. Our 
objective is to evaluate the performance of our model when the distribution of data is comprised of 
multiple components. In the second test, we generate data using a smooth nonlinear function. Our 
goal is to evaluate the robustness of our model when data actually come from a different model. 

For the first test, we compare the models using a synthetic four- way classification problem with 
5 covariates. Data are generated according to our model with Go being the following prior: 



2\ 



l<>g(<Tl 

log(r 2 ) 
\og{y 2 ) 



N(0,1) 
N(0,2 2 ) 
iV(0,0.1 2 ) 
iV(0,2 2 ) 



Note that ctj\r 



N(0,t 2 ), and p jt \ v ~ N(0, v ), where I = 1, 5 and j = 1, 4. From the above 
baseline prior, we sample two components, 6\ and 62, where 9 = (fj,,a,r],u,a,l3). For each 6, we 
generate 5000 data points by first drawing xu ~ iV(/i/, 07) and then sampling y using the following 
MNL model: 



P(y = j\x,a,P) 



exp(aj + x(3j) 
E/'=i ex P(«j' +x(3j,) 



The overall sample size is 10000. We randomly split the data to the training set, with 100 data 
points, and test set, with 9900 data points. We use the training set to fit the models, and use 
the independent test set to evaluate their performance. The regression parameters of the Bayesian 
MNL model with Bayesian estimation and the qMNL model have the following priors: 



aj\r 


~ N(0,t 2 ) 




- iV(0,z/ 2 ) 


log (rj) 


~ iV(0,l 2 ) 


logH 


~ iV(0,2 2 ) 



To fit the decision tree models (jBreiman et all Il993l ) , we used the available functions in MAT- 



LAB. These functions are treefit, treetest (for cross-validation) and treeprune. 
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Model 


A PP1 1 TJ} PV f % I 


Fi (%) 


Baseline 


40.0 ( 


1 K A Q 

Io.4o 


l\/T \l I i I I\/T P"vi ttii i in 1 ,i Irol lhnnn 1 


77 SO 


66 65 


MNL 


78.39 


66.52 


qMNL 


83.60 


74.16 


Tree (Cross Validation) 


70.87 


55.82 


dpMNL 


89.21 


81.00 



Table 1: Simulation 1: the average performance of models based on 50 simulated datasets. The 
Baseline model assigns test cases to the class with the highest frequency in the training set. 



The above procedure was repeated 50 times. Each time, new 9\ and 02 were sampled fro m the 
prior , and a new dataset was created based on these #'s. We used Hamiltonian dynamics (jNeall . 



19931 ) for updating the r egression pa rameters, a's and /3's. For all other parameters, we used single- 



variable slice sampling (jNeall . 120031 ) with the "stepping out" procedure to find an interval around 
the current point, and then the "shrinkage" procedure to sample from this interval. We also used 
slice sampling for updating the concentration parameter 7, where log(7) ~ N(— 3,2 2 ). This prior 
encourages smaller values of 7, which results in smaller number of c omponents . Note that the like- 
lihoo d for 7 depends only on C, the number of unique components (Neal, 200ol ; Escobar and West . 



19951 ). For all models we ran 5000 MCMC iterations to sample from the posterior distributions. 



We discarded the initial 500 samples and used the rest for prediction. 

The average results (over 50 repetitions) are presented in Table [TJ As we can see, our dpMNL 
model provides better results compared to all other models. The improvements are statistically 
significant (p-values < 0.001 based accuracy rates) using a paired i-test with n = 50. 

Since the data were generated according to the dpMNL model, it is not surprising that this 
model had the best performance compared to other models. In fact, as we increase the number 
of components, the amount of improvement using our model becomes more and more substantial 
(results not shown). To evaluate the robustness of the dpMNL model, we performed another test. 
This time, we generated xn , x^ , x^ (where i = 1,..., 10000) from the Uniform(0,5) distribution, 
and generated a binary response variable, yi, according the following model: 

P(y = l\x) ' 



1 + exp[ai sin(x}- 04 + 1.2) + x\ cos(a2X2 + 0.7) + 03X3 — 2] 

where 01, 02 and 03 are randomly sampled from iV(l,0. 5 2 ). The function used to generate y is a 
smooth nonlinear function of covariates. The covariates are not clustered, so the generated data 
do not conform with the assumptions of our model. Moreover, this function includes a completely 
arbitrary set of constants to ensure the results are generalizable. Figure [2] shows a random sample 
from this model for 03 = 0. In this figure, the dotted line is the optimal decision boundary. 

We generated 50 datasets (n = 10000) using the above model. Each time, we sampled new 
covariates, x, new constant values, 01,02,03, and new response variable, y. As before, models were 
trained on 100 data points, and tested on the remaining samples. The average results over 50 
datasets are presented in Table [2 As before, the dpMNL model provides significantly (all p- values 
are smaller than 0.001) better performance compared to all other models. This time, however, the 
performance of the qMNL model is closer to the results from the dpMNL model. 
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Figure 2: A random sample generated according to Simulation 2 with 03 = 0. The dotted line is 
the optimal boundary function. 



Model 


Accuracy (%) 


Fx (%) 


Baseline 


61.96 


37.99 


MNL (Maximum Likelihood) 


73.58 


68.33 


MNL 


73.58 


67.92 


qMNL 


75.60 


70.12 


Tree (Cross Validation) 


73.47 


66.94 


dpMNL 


77.80 


73.13 



Table 2: Simulation 2: the average performance of models based on 50 simulated datasets. The 
Baseline model assigns test cases to the class with the highest frequency in the training set. 



4 Results for protein fold classification 

In this section, we consider the problem of predicting a protein's 3D structure (i.e., folding class) 
based on its sequence. For this problem, it is common to presume that the number of possible folds is 
fixed, and use a classification model to assign a protein to one of the folding classes. There are more 
than 600 folding patte rns identified in the SCOP (Structural Classification of Proteins) database 
(|Lo Conte et all 12000 ) . In this database, proteins are considered to have the same folding class if 
they have the same major secondary structure in the same arrangement with the same topological 
connections. 



We apply our model to a protein fold recognition dataset provided bv lDing and Dubchak (2001). 



The proteins in this dataset are obtained from the PDB_select database (jHobohm et all LL992J; 



9 



Hob ohm and Sanded . 11994 ) such that two proteins have no more than 35% of the sequence identity 
for aligned subsequences larger than 80 resi d ues. Originally, the resulting dataset included 128 
unique folds. However, iDing and Dubchakl (|200ll ) selected only 27 most populated folds (311 
proteins) for their analysis. They evaluated their m odels based on an independent sample (i.e., test 
set) obtained from PDB-40d Ilo Conte etaU (hood). PDB-4 0D contains the SCOP sequences with 
less than 40% identity with each other. IDing and Dubchakl (|200ll ) selected 383 representatives of 
the same 27 folds in the training set with no more than 35% identity to the training sequences. The 
training and test datasets are available online at http://crd.lbl.gov/~cding/protein/, These 
datasets include the length of protein sequences, and 20 other covariates based on the percenta ge 
composition of different amino acids. For a detail description of data, see lDubchak et al\ (|1995l ). 



Ding and Dubchakl (|200ll ) trained several Support Vector Machines (SVM) with nonlinear kernel 



functions, and Neural Networks (NN) with different architecture on this dataset. They also tried 
different classification schemes, namely, one versus others (OvO), unique one versus others (uOvO), 
and all versus all (AvA). The details for these methods can be found in their paper. The performance 
of these models on the test set is presented in Table [3j 

We first centered the covariates so they have mean 0. We trained our MNL and dpMNL on the 
training set, and evaluated their performance on the test set. For these models, we used similar 
priors as the ones used in the previous section. However, the hyperparameters for the variances of 
regression parameters are more elaborate. We used the following priors for the MNL model: 



tXj\V 
log(?7 2 ) 

log(£ 2 ) 
log(af) 



iV(0,77 2 ) 
iV(0,2 2 ) 

N(0,1) 
iV(-3,4 2 ) 



Here, one hyperparameter, 07, is used to control the variance of all coefficients, 0ji (where j = 
1,..., J), for covariate x\. If a covariate is irrelevant, its hyperparameter will tend to be small, 
forcing the coefficients for that covariate to be near zero. T his method is called Automatic Relevance 
Determination (ARD), and was suggested by iNeall (|1996l ). We also used another hyperparameter, 
£, to control the overall magnitude of all /3's. This way, 07 controls the relevance of covariate xi 
compared to other covariates, and £ controls the overall usefulness of all covariates in separating 
all classes. The standard deviation of f3ji is therefore equal to £07. 

We used the same scheme for the MNL models in dpMNL. Note that, in this model one 07 
controls all (3ji c , where j = 1, J indexes classes, and c = 1, C indexes the unique components 
in the mixture. Therefore, the standard deviation of f3ji c is £<7{i/ c . Here, v c is specific to each 
component c, and controls the overall effect of coefficients in that component. That is, while a and 
£ are global hyperparameters common between all components, v c is a local hyperparameter within 
a component. Similarly, the standard deviation of intercepts, ay c in component c is tjt c . We used 
iV(0, 1) as the prior for v c and r c . 



We also needed to specify priors for m and 07 , the mean and standard deviation of covariate x\ , 
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Model 


i_V v.- LA-L Cl'Vj y I / (J I 


Fi (%) 


MTV ( \ -x ■ ( \ 


on ^ 




SVM-OvO 

O V 1VJ- V / V V/ 


i "J . (J 




SVM-uOvO 


49.4 




SVM-AvA 


44.9 




MNL 


50.0 


41.2 


dpMNL 


58.6 


53.0 



Table 3: Performance of models based on protein f old classification data. N N and SVM use 
maximum likelihood estimation, and are developed by Ding and Dubchak ( 200ll ). 



where I = 1, For these parameters, we used the following priors: 



Mo,; 



log(<rj 



o.i) 



log(al)\M a ,i,V a ,i 



M, 



a,l 



^g(V 2 



N(0,5 2 ) 
N(0,2 2 ) 

N(0,l 2 ) 
iV(0,2 2 ) 



As we can see, the priors depend on higher level hyperparameters. This provides a more flexi- 
ble scheme. If, for example, the components are not different with respect to covariate X[, the 
corresponding variance, a^ t , becomes small, forcing close to their overall mean, /i ,z- 

For each of our Bayesian models discussed in this section (and also in the following sections), 
we performed four simultaneous MCMC simulations each of size 10000. The chains have different 
starting values. We discarded the first 1000 samples from each chain and used the remaining 
samples for predictions. For this problem, running multiple chains results in faster and more 
efficient sampling. Simulating the Markov chain for 10 iterations took about half a minute for 
MNL, and about 3 minutes for dpMNL, using a MATLAB implementation on an UltraSPARC III 
machine. 

The results for MNL and dpMNL models are presented in Table [3l As a benchmark, we also 
present the results for the SVM and NN models developed by iDing and Dubchak! (|200ll ) on the 
exact same dataset. As we can see, our li near MNL model provides better accuracy rate compared 
to the SVM and NN models developed bv lDing and Dubchak! (|200ll ). Our dpMNL model provides 
an additional improvement over the MNL model. This shows that there is in fact a nonlinear 
relationship between folding classes and the composition of amino acids, and our nonlinear model 
could successfully identify this relationship. 

It is worth noting the performance of the NN models is influenced by many design choices, and 
by model assumptions. We found th at Bayesian neural n etwor ks model (jNeall . 1 19961 ) had better 
performance than the NN model of IDing and Dubchak! (|200ll ). Our NN model performs very 
similarly to the performance of the dpMNL model. 
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Class 1 Class 2 

Pi = 011 + 021 Pi = 011 + 022 



Class 3 Class 4 

Ps = 012 + 023 Pa = 012 + 024 



Figure 3: A simple representation of our hierarchical classification model. 



5 Extension to hierarchical classes 

In the previous section, we modeled the folding classes as a set of unrelated entities. However, 
these classes are not completely u nrelated, and can b e grou ped into four major structural classes 
known as a, j3, a/ (5, and a + j3. iDing and Dubchakl (|200ll ) show the corresponding hierarchical 
scheme (Table 1 in their paper). We have previou sly introduced a new approach for modeling 



hierarchical classes ( Shahbaba and Neal 



2006 



20071 ). In this approach, we use a Bayesian form of 



the multinomial logit model, with a prior that introduces correlations between the parameters for 
classes that are nearby in the hierarchy. 

Figure illustrates this approach using a simple hierarchical structure. For each branch in the 
hierarchy, we define a different set of parameters, 0. Our model classifies objects to one of the end 
nodes using an MNL model whose regression coefficients for class j are represented by the sum of 
the parameters for all the branches leading to that class. Sharing of common parameters (from 
common branches) introduces prior correlations between the parameters of nearby classes in the 
hierarchy. We refer to this model as corMNL. 

In this section, we extend our nonlinear model to classification problems where classes have a 
hierarchical structure. For this purpose, we use a corMNL model, instead of MNL, to capture the 
relationship between the covariates, x, and the response variable, y, within each component. The 
results is a nonlinear model which takes the hierarchical structure of classes into account. We refer 
to this models as dpCorMNL. 

Table H] presents the results for the two linear models (with and without hierarchy-base priors), 
and two nonlinear models (with and without hierarchy-based priors). In this table, "parent accu- 
racy" refers to the accuracy of models based on the four major structural classes, namely a, f3, a/ (3. 
When comparing the hierarchical models to their non-hierarchical counterparts, the advantage of 
using the hierarchy is apparent only for some measures (i.e., parent accuracy rate for corMNL, 
and the F\ measure for dpCorMNL). As we can see, however, the dpCorMNL model provides a 
substantial improvement over corMNL. 
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Model 


Accuracy (%) 


Parent accuracy (%) 


Fi (%) 


MNL 


50.0 


76.5 


41.2 


corMNL 


49.5 


77.9 


41.4 


dpMNL 


58.6 


79.9 


53.0 


dpCorMNL 


59.1 


79.4 


55.2 



Table 4: Comparison of hierarchical models (linear and nonlinear) with non-hierarchical models 
(linear and nonlinear) based on protein fold classification data. 



Model 


Accuracy (%) 


Parent accuracy (%) 


Fx (%) 


NN-OvO 


41.4 






SVM-OvO 


43.2 






SVM-uOvO 


49.4 






SVM-AvA 


56.5 






MNL 


56.5 


80.4 


51.4 


corMNL 


59.6 


83.3 


54.6 


dpMNL 


60.4 


82.0 


55.9 


dpCorMNL 


61.4 


83.8 


57.8 



Table 5: Comparison of hierarchical models (linear and nonlinear) with non-hierarchical models 
(linear and nonlinear) based on protein fold classification data. The covariates are obtained from 
four different feature sets: composition of amino acids, predicted secondary structure, hydropho- 
bic! ty, and normalized van der Waals volume. 



6 Extension to multiple datasets 



In order to improve the prediction of folding classes for proteins, iDing and Dubchakl (|200ll ) com- 
bined the feature set based on amino acid compositions with 5 other feature sets, which were inde- 
pendently extracted based on various physico-chemical and structural properties of amino acids in 
the sequence. The additional features predicted secondary structure, hydrophobicity, normalized 
varn der Waals volume, polarity, and p qlarizability. Each da ta source has 21 covariates. For a 
detailed description of these features, see lDubchak et al\ (|1995l ). IDing and Dubchakl (|200ll ) added 
the above 5 datasets sequentially to the amino acid composition dataset. For prediction, they used 
a majority voting system, in which the votes obtained from models based on different features sets 
are combined, and the class with the most votes is regarded as predicted fold. Their results show 
that adding additional feature sets can improve the performance in some cases and can result in 
lower performance in some other cases. One main issue with this method is that it gives equal 
weights to votes based on different data sources. The underlying assumption, therefore, is that the 
quality of predictions is the same for all sources of inform ation. This is, of course, not a realistic 
assumption for many real problems. In our previous paper (jShahbaba and Neall . 120061 ) . we provided 
a new scheme for combining different sources of information. In this approach, we use separate 
scale parameters, £;, for each data source in order to adjust their relative weights automatically. 
This allows the coefficients from different sources of data to have appropriately different variances 
in the model. 



For models developed bv lDing and Dubchakl (|200ll ). the highest accuracy rate, 56.5, was achieved 
only when they combined the covariates based on the composition of amino acids, secondary struc- 
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ture, hydrophobicity, and polarity. We also used these four datasets, and applied our models to 
the combined data. We used a different scale parameters, £, for each dataset. The results from 
our models are pre s ented in Table For comparison, we also present the results obtained by 



Ding and Dubchakl (|200ll ) based on the same datasets. As we can see, this time, using the hi- 



erarchy results in more substantial improvements. Moreover, nonlinear models provided better 
performance compared to their corresponding linear models. 



7 Conclusions and future directions 

We introduced a new nonlinear classification model, which uses Dirichlet process mixtures to model 
the joint distribution of the response variable, y, and the covariates, x, non-parametrically. We 
compared our model to several linear and nonlinear alternative methods using both simulated and 
real data. We found that when the relationship between y and x is nonlinear, our approach provides 
substantial improvement over alternative methods. One advantage of this approach is that if the 
relationship is in fact linear, the model can easily reduce to a linear model by using only one 
component in the mixture. This way, it avoids overfitting, which is a common challenge in many 
nonlinear models. 

We believe our model can provide more interpretable results. In many real problems, the iden- 
tified components may correspond to a meaningful segmentation of data. Since the relationship 
between y and x remains linear in each segment, the results of our model can be expressed as a set 
of linear patterns for different segments of data. 

As mentioned above, for sampling from the posterior distribution, we used multiple chains which 
appeared to be sampling different regions of the posterior space. Ideally, we prefer to have one 
chain that can efficiently sample from the whole posterior distribution. In future, we intend to 
improve our MCMC sampling. For t his purpose, we can u se more efficient methods, such as the 



'split-merge" approach in troduced by I Jain and Neall (|2007l ) and the short-cut Metropolis method 



introduced bv lNeall (|2005l ). 



In this paper, we considered only continuous covariates. Our approach can be easily extended to 
situations where the covariate are categorical. For these problems, we need to replace the normal 
distribution in the baseline, Go, with a more appropriate distribution. For example, when the 
covariate x is binary, we can assume x ~ Bernoulli(fi), and specify an appropriate prior distribution 
(e.g., Beta distribution) for [i. Alternatively, we can use a continuous latent variable, z, such that 
fj, = exp(z)/{l + exp(2;)}. This way, we can still model the distribution of z as a mixture of normals. 
For covariates with multinomial distribution, we can either extend the Bernoulli distribution by 
using (/ii, ...,(ak), where K is the number of categories in x, or use K continuous latent variables, 
21, ...,z K , and set 9j = exp{zj)/J2j' ex P(4)- 

Our model can also be extended to problems where the response variable is not multinomial. 
For example, we can use this approach for regression problems with continuous response, y. The 
distribution of y can be assumed normal within a component. We model the mean of this normal 
distribution as a linear function of covariates for cases that belong to that component. Other types 
of response variables (i.e., with Poisson distribution) can be handled in a similar way. 

Finally, our approach provides a convenient framework for semi-supervised learning, in which 
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both labeled and unlabeled data are used in the learning process. In our approach, unlabeled data 
can contribute to modeling the distribution of covariates, x, while only labeled data are used to 
identify the dependence between y and x. This is a quite useful approach for problems where the 
response variable is known for a limited number of cases, but a large amount of unlabeled data can 
be generated. One such problem is classification of web documents. In future, we will examine the 
application of our approach for these problems. 
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